QS World University Rankings Analysis Project Milestone 1

Weike ZHANG, Ruoqin JI

May 2024

For more details, datasets, and analysis scripts, visit our GitHub webpage.

Project Outline¶

Introduction¶

The QS World University Rankings are a globally recognized framework for evaluating higher education institutions. This project will analyze ranking trends from 2022 to 2024 to uncover patterns and determinants of university performance. The findings will serve as an empirical guide for stakeholders in the education sector.

Objectives:¶

  • To identify trends and shifts in university rankings over the specified years.
  • To understand the impact of various performance metrics on the rankings.
  • To provide insights for educational institutions aiming to improve their standings.

Data and Summary Statistics¶

I. Data Sources (Extraction, Transform, and Load)¶

  • Description of the datasets for 2022, 2023, and 2024, including data structure and collection methodology.
  • Data size and completeness, with an emphasis on any data preprocessing conducted.

II. Summary Statistics¶

  • Computation of summary statistics for critical variables to establish a baseline understanding of the dataset's characteristics.

Measure and Variable Definition¶

  • In-depth explanation of QS ranking metrics.
  • Discussion on how each metric is quantified and its presumed influence on the overall rankings.

Exploratory Data Analysis (EDA)¶

I. Ranking Trends¶

  • Tracking shifts in rankings across the years and pinpointing outliers.
  • Identifying institutions with notable improvements or declines.

II. Metric Correlations¶

  • Investigating the interrelationship between ranking metrics using correlation analysis.
  • Visualizations to showcase the strength and direction of these relationships.

III. Geographic Trends¶

  • Geographic analysis of the distribution of top-ranked institutions.
  • Examination of regional performance and disparities.

IV. Internationalization¶

  • Evaluating the influence of international faculty and student presence on ranking outcomes.

Empirical Results¶

I. Regression Analysis¶

  • Linear regression models to estimate the effect of ranking metrics on the overall score.
  • Discussion of the model's assumptions, validations, and any transformations applied to the data.

II. Predictive Modelling¶

  • Developing predictive models to forecast future rankings based on identified trends.
  • Validation of predictive accuracy through back-testing with historical data.

Conclusion and Implications¶

  • Synthesis of key findings and their implications for universities and policymakers.
  • Discussion of the study's limitations and suggestions for further research.

Additional Sections:¶

  • Methodology: Detailed justification of statistical methods used.
  • Ethical Considerations: Reflection on the ethical aspects of ranking interpretations.
  • Peer Review: Strategy for peer review to validate findings.

Appendices:¶

  • Detailed tables, additional analyses, and a glossary of terms used throughout the project.

References:¶

  • Detailed bibliography citing data sources, literature, and methodologies referenced.

Data and Summary Statistics¶

I. Data Sources (Extraction, Transform, and Load)¶

QS World University Rankings The QS World University Rankings provide a comprehensive evaluation of over 1,000 higher education institutions globally. Sourced from Quacquarelli Symonds (QS), these rankings are recognized worldwide for their depth of research and breadth of data regarding university performance. The datasets for 2022, 2023, and 2024, accessible through the QS website, form the primary basis of our analysis. These tables offer detailed insights into various performance metrics such as academic reputation, employer reputation, faculty-student ratio, citations per faculty, international faculty, and international students scores. By analyzing these datasets, we aim to uncover trends, evaluate shifts in rankings, and identify the determinants of university performance across the specified years.

  • View the QS World University Rankings 2022 Report
  • QS World University Rankings 2023 Result Tables - Excel
  • QS World University Rankings 2024 Results Table - Excel

QS World University Rankings Metrics Explained¶

The QS ranking methodology utilizes several metrics to gauge university performance, each capturing a distinct aspect of university excellence:

  • Academic Reputation Score (40% weight): Derived from a global academic survey, this score reflects the perceived research quality and academic standing of an institution.

  • Employer Reputation Score (10% weight): Based on a survey of employers, this score indicates the employability and preparedness of graduates in the workforce.

  • Faculty Student Score (20% weight): This metric measures the faculty-to-student ratio, providing insight into the teaching and learning environment of the university.

  • Citations per Faculty Score (20% weight): A measure of research impact, this score is calculated based on the average citations per faculty member, indicating research influence and quality.

  • International Faculty Score (5% weight): This score assesses the diversity of the faculty by measuring the proportion of international faculty members at the institution.

  • International Students Score (5% weight): Similarly, this score evaluates the diversity of the student body by looking at the percentage of international students.

  • Overall Score: A composite score that combines all individual metrics, representing a summarized assessment of a university's overall ranking performance.

In [1]:
import pandas as pd
import os
import sys
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [3]:
!git clone https://github.com/weike2001/ds
Cloning into 'ds'...
remote: Enumerating objects: 29, done.
remote: Counting objects: 100% (29/29), done.
remote: Compressing objects: 100% (25/25), done.
remote: Total 29 (delta 3), reused 0 (delta 0), pack-reused 0
Receiving objects: 100% (29/29), 921.93 KiB | 7.26 MiB/s, done.
Resolving deltas: 100% (3/3), done.
In [4]:
import pandas as pd

# Set the paths to the Excel files in the cloned repository
file_path_2022 = '/content/ds/data/2022_QS_World_University_Rankings_Results_public_version.xlsx'
file_path_2023 = '/content/ds/data/2023 QS World University Rankings V2.1 (For qs.com).xlsx'
file_path_2024 = '/content/ds/data/2024 QS World University Rankings 1.2 (For qs.com).xlsx'

# Read the data into pandas DataFrames
df_2022 = pd.read_excel(file_path_2022)
df_2023 = pd.read_excel(file_path_2023)
df_2024 = pd.read_excel(file_path_2024)

# Assuming you want to save these DataFrames as CSV files in the same directory
csv_file_path_2022 = file_path_2022.replace('.xlsx', '.csv')
csv_file_path_2023 = file_path_2023.replace('.xlsx', '.csv')
csv_file_path_2024 = file_path_2024.replace('.xlsx', '.csv')

# Save the DataFrames as CSV files
df_2022.to_csv(csv_file_path_2022, index=False)
df_2023.to_csv(csv_file_path_2023, index=False)
df_2024.to_csv(csv_file_path_2024, index=False)

# Now you can work with the DataFrames directly or the saved CSV files
# For example, you can print the head of the 2022 DataFrame
print(df_2022.head())
        Unnamed: 0         Unnamed: 1          2011     Unnamed: 3  \
0         NATIONAL           REGIONAL          2022           2021   
1             RANK               RANK          RANK           RANK   
2  rank in country  rank in subregion  rank display  rank display2   
3                1                  1           1              1     
4                1                  1           2              5     

                                     Unnamed: 4    Unnamed: 5  \
0                              Institution Name      Location   
1                                           NaN          CODE   
2                                   institution  country code   
3  Massachusetts Institute of Technology (MIT)             US   
4                          University of Oxford            UK   

            Unnamed: 6      Unnamed: 7 Unnamed: 8 Unnamed: 9  ... Unnamed: 15  \
0                  NaN  Classification        NaN        NaN  ...         NaN   
1  COUNTRY / TERRITORY            SIZE      FOCUS       RES.  ...        RANK   
2              country            size      focus   research  ...     er rank   
3        United States               M         CO         VH  ...           4   
4       United Kingdom               L         FC         VH  ...           3   

       Unnamed: 16 Unnamed: 17            Unnamed: 18 Unnamed: 19  \
0  Faculty Student         NaN  Citations per Faculty         NaN   
1            SCORE        RANK                  SCORE        RANK   
2        fsr score    fsr rank              cpf score    cpf rank   
3              100          12                    100           6   
4              100           5                     96          34   

             Unnamed: 20 Unnamed: 21             Unnamed: 22 Unnamed: 23  \
0  International Faculty         NaN  International Students         NaN   
1                  SCORE        RANK                   SCORE        RANK   
2              ifr score    ifr rank               isr score    isr rank   
3                    100          45                    91.4         105   
4                   99.5          83                    98.5          52   

    Unnamed: 24  
0       Overall  
1         SCORE  
2  score scaled  
3           100  
4          99.5  

[5 rows x 25 columns]

Adjust columns in each csv form

In [5]:
import pandas as pd

# Define the new specific column names
specific_column_names_2022 = [
    'National Rank', 'Regional Rank', '2022 Rank', '2021 Rank', 'Institution Name',
    'Location Code', 'Country/Territory', 'Size', 'Focus', 'Research Intensity',
    'Age Band', 'Status', 'Academic Reputation Score', 'Academic Reputation Rank',
    'Employer Reputation Score', 'Employer Reputation Rank', 'Faculty Student Score',
    'Faculty Student Rank', 'Citations per Faculty Score', 'Citations per Faculty Rank',
    'International Faculty Score', 'International Faculty Rank', 'International Students Score',
    'International Students Rank', 'Overall Score'
]

specific_column_names_2023 = [
    '2023 Rank', '2022 Rank', 'Institution Name', 'Location Code', 'Country/Territory',
    'Size', 'Focus', 'Research Intensity', 'Age Band', 'Status',
    'Academic Reputation Score', 'Academic Reputation Rank',
    'Employer Reputation Score', 'Employer Reputation Rank',
    'Faculty Student Score', 'Faculty Student Rank',
    'Citations per Faculty Score', 'Citations per Faculty Rank',
    'International Faculty Score', 'International Faculty Rank',
    'International Students Score', 'International Students Rank',
    'International Research Network Score', 'International Research Network Rank',
    'Employment Outcomes Score', 'Employment Outcomes Rank',
    'Overall Score'
]

specific_column_names_2024 = [
    '2024 Rank', '2023 Rank', 'Institution Name', 'Location Code', 'Country/Territory',
    'Size', 'Focus', 'Research Intensity', 'Status',
    'Academic Reputation Score', 'Academic Reputation Rank',
    'Employer Reputation Score', 'Employer Reputation Rank',
    'Faculty Student Score', 'Faculty Student Rank',
    'Citations per Faculty Score', 'Citations per Faculty Rank',
    'International Faculty Score', 'International Faculty Rank',
    'International Students Score', 'International Students Rank',
    'International Research Network Score', 'International Research Network Rank',
    'Employment Outcomes Score', 'Employment Outcomes Rank',
    'Sustainability Score', 'Sustainability Rank',
    'Overall Score'
]

print(len(specific_column_names_2024))
# Reading the CSV files into Pandas DataFrames
df_2022 = pd.read_csv(csv_file_path_2022, skiprows = 4, names=specific_column_names_2022)
df_2023 = pd.read_csv(csv_file_path_2023, skiprows = 4, names=specific_column_names_2023)
df_2024 = pd.read_csv(csv_file_path_2024, skiprows = 4, names=specific_column_names_2024)

df_2022.head()
df_2023.head()
df_2024.head()
28
Out[5]:
2024 Rank 2023 Rank Institution Name Location Code Country/Territory Size Focus Research Intensity Status Academic Reputation Score ... International Faculty Rank International Students Score International Students Rank International Research Network Score International Research Network Rank Employment Outcomes Score Employment Outcomes Rank Sustainability Score Sustainability Rank Overall Score
0 1 1 Massachusetts Institute of Technology (MIT) US United States M CO VH B 100.0 ... 56 88.2 128 94.3 58 100.0 4 95.2 51 100
1 2 2 University of Cambridge UK United Kingdom L FC VH A 100.0 ... 64 95.8 85 99.9 7 100.0 6 97.3 33= 99.2
2 3 4 University of Oxford UK United Kingdom L FC VH A 100.0 ... 110 98.2 60 100.0 1 100.0 3 97.8 26= 98.9
3 4 5 Harvard University US United States L FC VH B 100.0 ... 210 66.8 223 100.0 5 100.0 1 96.7 39 98.3
4 5 3 Stanford University US United States L FC VH B 100.0 ... 78 51.2 284 95.8 44 100.0 2 94.4 63 98.1

5 rows × 28 columns

In this section, we focus on preparing the 'Overall Score' data from the QS World University Rankings for 2022, 2023, and 2024. The preparation involves two key steps:

  1. Replacing Missing Values: We convert missing values, originally represented as hyphens ('-'), to NaN (Not a Number) to standardize the dataset for numerical analysis.
  2. Converting to Numeric: The 'Overall Score' column is converted from string type to floating-point numbers, facilitating statistical operations and analysis.

Objectives:

  • Clean and standardize the data for accurate analysis.
  • Enable computation of descriptive statistics and facilitate trend analysis across years.
  • Assess the completeness of the data to ensure robust analytical outcomes.

This data preparation is essential for analyzing global university ranking trends and setting the stage for further in-depth examination of university performances.

In [6]:
import pandas as pd
import numpy as np

# Replace hyphens with NaN and convert the column to numeric
df_2022['Overall Score'] = pd.to_numeric(df_2022['Overall Score'].replace('-', np.nan), errors='coerce')
df_2023['Overall Score'] = pd.to_numeric(df_2023['Overall Score'].replace('-', np.nan), errors='coerce')
df_2024['Overall Score'] = pd.to_numeric(df_2024['Overall Score'].replace('-', np.nan), errors='coerce')

# Now, 'Overall Score' will be a float column with NaNs where there were hyphens - .

II. Summary Statistics¶

In our analysis of the QS World University Rankings datasets spanning 2022 to 2024, we direct our attention to a curated selection of metrics that significantly influence a university's prestige and global ranking. The evaluation encompasses:

  • Academic Reputation Score: A gauge of a university's academic eminence as recognized by peers.
  • Employer Reputation Score: A reflection of the institution's graduate employability and readiness for the professional world.
  • Citations per Faculty Score: An index of research influence and scholarly impact.
  • International Faculty Score: A measure of the institution's success in fostering a diverse and global faculty.
  • International Students Score: An indicator of the university's ability to attract a worldwide student body.
  • Overall Score: A comprehensive score that embodies all individual metrics, offering a summarized assessment of a university's worldwide standing and performance.

For these pivotal metrics, we compute the mean, standard deviation, median, minimum, and maximum values to provide a distilled overview of university performance. This analysis will shed light on the average achievements, consistency, and range within these critical areas, offering stakeholders a succinct and strategic insight into the dynamics shaping university rankings.

In [7]:
import pandas as pd

# Assuming df_2022, df_2023, and df_2024 have already been loaded

# Selected metrics to compute summary statistics
selected_metrics = [
    'Academic Reputation Score', 'Employer Reputation Score',
    'Citations per Faculty Score', 'International Faculty Score',
    'International Students Score', 'Overall Score'
]

# Function to calculate and print selected descriptive statistics
def print_selected_statistics(df, year, metrics):
    print(f"Selected Descriptive Statistics for {year}:")
    stats = df[metrics].describe().loc[['mean', 'std', 'min', '50%', 'max']]
    print(stats, "\n")  # Prints the mean, standard deviation, median, min, and max

# Call the function for each year's DataFrame
print_selected_statistics(df_2022, "2022", selected_metrics)
print_selected_statistics(df_2023, "2023", selected_metrics)
print_selected_statistics(df_2024, "2024", selected_metrics)
Selected Descriptive Statistics for 2022:
      Academic Reputation Score  Employer Reputation Score  \
mean                  21.552462                  22.193000   
std                   23.315627                  24.535947   
min                    1.000000                   1.000000   
50%                   11.900000                  11.950000   
max                  100.000000                 100.000000   

      Citations per Faculty Score  International Faculty Score  \
mean                    26.293308                    26.503746   
std                     28.299027                    35.429502   
min                      1.000000                     1.000000   
50%                     13.400000                     5.400000   
max                    100.000000                   100.000000   

      International Students Score  Overall Score  
mean                     28.119059      44.767066  
std                      31.211629      18.961269  
min                       1.000000      24.100000  
50%                      13.200000      38.600000  
max                     100.000000     100.000000   

Selected Descriptive Statistics for 2023:
      Academic Reputation Score  Employer Reputation Score  \
mean                  20.124684                  20.657143   
std                   22.802706                  24.027928   
min                    1.000000                   1.000000   
50%                   10.800000                  10.300000   
max                  100.000000                 100.000000   

      Citations per Faculty Score  International Faculty Score  \
mean                    24.529358                    31.659517   
std                     27.910952                    34.170817   
min                      1.000000                     1.000000   
50%                     11.100000                    13.750000   
max                    100.000000                   100.000000   

      International Students Score  Overall Score  
mean                     26.545348      44.619400  
std                      30.896854      18.655057  
min                       1.000000      24.200000  
50%                      10.800000      38.550000  
max                     100.000000     100.000000   

Selected Descriptive Statistics for 2024:
      Academic Reputation Score  Employer Reputation Score  \
mean                  20.132043                  19.806880   
std                   22.365895                  23.764625   
min                    1.600000                   1.000000   
50%                   10.900000                   9.500000   
max                  100.000000                 100.000000   

      Citations per Faculty Score  International Faculty Score  \
mean                    23.940163                    30.948834   
std                     28.075573                    34.247562   
min                      1.000000                     1.000000   
50%                     10.400000                    13.050000   
max                    100.000000                   100.000000   

      International Students Score  Overall Score  
mean                     25.575035      40.879900  
std                      30.867149      19.181335  
min                       1.000000      19.800000  
50%                       9.850000      34.550000  
max                     100.000000     100.000000   

Measure and Variable Definition¶

This section is dedicated to a comprehensive examination of the QS World University Rankings' metrics. We aim to dissect each component of the ranking system to provide an intricate understanding of how universities are evaluated and ranked on the global stage.

The QS ranking framework employs a set of multifaceted metrics, each designed to quantify distinct aspects of university performance. These metrics are:

  • Academic Reputation Score (40%): Derived from a global survey, reflecting the university's standing in the academic community.
  • Employer Reputation Score (10%): Based on employer surveys, indicating the quality and employability of the institution's graduates.
  • Faculty/Student Ratio (20%): A metric that assesses the faculty-to-student ratio, providing insights into the educational environment.
  • Citations per Faculty (20%): This measures the average number of citations per faculty member, serving as an indicator of research impact.
  • International Faculty Score (5%) and International Students Score (5%): Both these scores evaluate the university's internationalization by measuring the diversity of faculty and student bodies.

The Overall Score represents a consolidated assessment derived from these individual metrics, dictating the university's ranking.

In This Section:¶

  • We will analyze each metric in detail, understanding the data sources, methodology, and computation.
  • Discuss the weight each metric carries and its hypothesized impact on the overall ranking.
  • Conduct a comparative evaluation across various universities to identify strengths and weaknesses relative to these metrics.
  • Reflect on the historical evolution of these metrics and their definitions to appreciate changes in higher education quality assessment.

Through this deep dive into the QS ranking metrics, we seek to elucidate the nuances that underpin university rankings, providing a clear guide for institutions aiming to enhance their global standing.

In [8]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming the dataframes df_2022, df_2023, and df_2024 have already been loaded

# Define the metrics and their weights according to QS methodology
qs_metrics_weights = {
    'Academic Reputation Score': 0.40,
    'Employer Reputation Score': 0.10,
    'Faculty Student Score': 0.20,
    'Citations per Faculty Score': 0.20,
    'International Faculty Score': 0.05,
    'International Students Score': 0.05
}

# Function to analyze and plot each metric
def analyze_qs_metrics(df, year):
    print(f"Analyzing QS Ranking Metrics for {year}")
    for metric, weight in qs_metrics_weights.items():
        df[metric] = pd.to_numeric(df[metric], errors='coerce')  # Ensure the metric is numeric
        # Plot the distribution of each metric
        plt.figure(figsize=(10, 6))
        sns.histplot(df[metric].dropna(), kde=True)
        plt.title(f"Distribution of {metric} in {year}")
        plt.xlabel(metric)
        plt.ylabel('Frequency')
        plt.show()
        # Print the weight of the metric
        print(f"{metric} has a weight of {weight*100}% in the overall ranking.\n")

# Analyze metrics for each year
analyze_qs_metrics(df_2022, '2022')
analyze_qs_metrics(df_2023, '2023')
analyze_qs_metrics(df_2024, '2024')
Analyzing QS Ranking Metrics for 2022
Academic Reputation Score has a weight of 40.0% in the overall ranking.

Employer Reputation Score has a weight of 10.0% in the overall ranking.

Faculty Student Score has a weight of 20.0% in the overall ranking.

Citations per Faculty Score has a weight of 20.0% in the overall ranking.

International Faculty Score has a weight of 5.0% in the overall ranking.

International Students Score has a weight of 5.0% in the overall ranking.

Analyzing QS Ranking Metrics for 2023
Academic Reputation Score has a weight of 40.0% in the overall ranking.

Employer Reputation Score has a weight of 10.0% in the overall ranking.

Faculty Student Score has a weight of 20.0% in the overall ranking.

Citations per Faculty Score has a weight of 20.0% in the overall ranking.

International Faculty Score has a weight of 5.0% in the overall ranking.

International Students Score has a weight of 5.0% in the overall ranking.

Analyzing QS Ranking Metrics for 2024
Academic Reputation Score has a weight of 40.0% in the overall ranking.

Employer Reputation Score has a weight of 10.0% in the overall ranking.

Faculty Student Score has a weight of 20.0% in the overall ranking.

Citations per Faculty Score has a weight of 20.0% in the overall ranking.

International Faculty Score has a weight of 5.0% in the overall ranking.

International Students Score has a weight of 5.0% in the overall ranking.

Exploratory Data Analysis (EDA)¶

I. Ranking Trends¶

  • Tracking shifts in rankings across the years and pinpointing outliers.
  • Identifying institutions with notable improvements or declines.

Geographic Distribution of QS Ranked Universities¶

To gain a deeper understanding of the global landscape of higher education as reflected in the QS World University Rankings, we employ choropleth maps to visualize the distribution of ranked universities by country for the years 2022, 2023, and 2024. This geographic analysis allows us to observe trends, patterns, and potentially the regional dynamics influencing higher education excellence on a global scale.

The function create_choropleth_map is crafted to:

  1. Count Universities by Country: For each year, it calculates the number of universities within each country that appear in the QS rankings.
  2. Generate a Choropleth Map: Utilizing Plotly Express, it creates an interactive map highlighting countries based on the count of their ranked universities. The intensity of the color corresponds to the number of universities, providing a clear visual representation of higher education hubs worldwide.

Here's a brief overview of the function and its application:

In [9]:
def configure_plotly_browser_state():
    import IPython
    display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
            requirejs.config({
            paths: {
                base: '/static/base',
                plotly: 'https://cdn.plot.ly/plotly-latest.min.js?noext',
            },
            });
        </script>
        '''))
In [10]:
def enable_plotly_in_cell():
  import IPython
  from plotly.offline import init_notebook_mode
  display(IPython.core.display.HTML('''<script src="/static/components/requirejs/require.js"></script>'''))
  init_notebook_mode(connected=False)
In [12]:
import pandas as pd
import plotly.express as px
import plotly

#enable_plotly_in_cell()
def create_choropleth_map(dataframe, column_name, title):

    # Generate a dictionary of value counts for the specified column
    sample_data = dataframe[column_name].value_counts().to_dict()

    # Convert the dictionary into a DataFrame
    df_counts = pd.DataFrame(list(sample_data.items()), columns=['Country', 'University_Count'])
    #print(df_counts)
    # Create the choropleth map
    fig = px.choropleth(df_counts,
                        locations="Country",
                        locationmode='country names',
                        color="University_Count",
                        color_continuous_scale=px.colors.sequential.Reds,  # Reds color scale
                        title=title)

    # Update the layout
    fig.update_layout(
        geo=dict(
            showframe=False,
            showcoastlines=False,
            projection_type='equirectangular'
        )
    )
    #configure_plotly_browser_state()
    # Show the figure
    fig.show(renderer="notebook")

# Use the function with your DataFrame and column
create_choropleth_map(df_2022, 'Country/Territory', 'Number of Universities per Country in 2022')
create_choropleth_map(df_2023, 'Country/Territory', 'Number of Universities per Country in 2023')
create_choropleth_map(df_2024, 'Country/Territory', 'Number of Universities per Country in 2024')

Reference¶

Our excel files come from links below:

  • View the QS World University Rankings 2022 Report
  • QS World University Rankings 2023 Result Tables - Excel
  • QS World University Rankings 2024 Results Table - Excel